First things, first. Let’s load all the necessary libraries we’ll need.
library(tidyverse) # Load all the necessary tidy tools packages for data manipulation/cleaning
library(summarytools) # See summary of data
library(skimr) # View summary statistics
library(here) # Find current working directory and directories of files
library(janitor) # R functions to find duplicate values and create crosstables
library(glue) # paste text values togethers
You have one of two options to load survey data. You can use the File drop down, then click on Import Dataset, then From Text (readr) and follow the UI to import the data.
OR, if you know the file path, you can read the data set using the code below
#store the file path name
file_name <- "survey_file.csv"
#read in the survey data from the 'data' folder and name the dataframe `df`
df <- read_csv(here::here("data", file_name))
Let’s preview the survey data
# Uncomment this code to preview the data
# View(df)
We can see that the first two rows are unnecessary to reading the data. When we load the data, we can tell R to skip adding those first two lines into the data frame.
Try this with the Import UI, and/or try doing this by reading the readr documentation and writing your own code
?read_csv
df_raw <- read_csv(here::here("data", file_name),
skip = 2)
# View(df_raw)
We’ve managed to exclude the metadata from our data frame, but now the column names don’t seem to make sense…
What we’ll need to do instead is store the column names in a separate variable, then read it back in to the data frame.
# Store column names
df_names <- read_csv(here::here("data", file_name), n_max=0) %>% names()
# Read the entire file
df_raw <- read_csv(here::here("data", file_name),
col_names = df_names, # use df_names to title the columns
skip = 3) # skip the first three lines
# View(df_raw)
Let’s use skim() to get an overview of our data set
# Use this block to run skim on the `df` data frame
Now, let’s try using the package summarytools to view our data set. First try the function dfSummary(), then try view(dfSummary())
# dfSummary(df_raw)
# same output, but more readable in the Viewer pane
# view(dfSummary(df_raw))
What do you notice when looking at the report?
From the report you can see the min and max of the start dates for the survey, Q4 has some missing values, Q6 has missing values, 41% of the responses came from ci = 1, and there are 12 of the same emails in column e.
This report is helpful in seeing if there are any anomalies or large missing values in the data set. Sometimes this report will catch whether we’ve read the data incorrectly and stop us before we analyze a broken data set.
The column ci might not be easily understood by other people trying to read your code. Let’s rename the column to something that’s more readable.
df_raw <-
df_raw %>%
rename(city_id = ci)
#run names() to check the column names of the data set
names(df_raw)
## [1] "StartDate" "EndDate"
## [3] "Status" "IPAddress"
## [5] "Progress" "Duration (in seconds)"
## [7] "Finished" "RecordedDate"
## [9] "ResponseId" "RecipientLastName"
## [11] "RecipientFirstName" "RecipientEmail"
## [13] "ExternalReference" "LocationLatitude"
## [15] "LocationLongitude" "DistributionChannel"
## [17] "UserLanguage" "Q1"
## [19] "Q2" "Q3_1"
## [21] "Q3_2" "Q3_3"
## [23] "Q3_4" "Q3_5"
## [25] "Q3_6" "Q4"
## [27] "Q4_10_TEXT" "Q5"
## [29] "Q6" "Q6_4_TEXT"
## [31] "Q7_1" "Q7_2"
## [33] "Q7_3" "Q7_4"
## [35] "Q7_5" "Q7_6"
## [37] "Q7_6_TEXT" "Q8"
## [39] "city_id" "e"
From the summary report, it also looks like there are a few duplicate emails. Let’s count how many duplicates there are.
df_raw %>%
group_by(e) %>%
count() %>%
arrange(desc(n))
This time, let’s use the janitor package to find duplicate emails.
df_raw %>%
get_dupes(e)
Let’s delete the duplicate emails and keep the first response from the duplicated emails.
#Remove duplicate emails, keep the first submission date
df_raw <- df_raw %>%
group_by(e) %>%
slice(which.min(StartDate)) %>%
ungroup()
Let’s double check that we’ve removed all of the duplicates. We’d expect get_dupes() to return 0 rows.
df_raw %>%
get_dupes(e)
We can also see that the number of observations in df_raw is now 1025, meaning 32 of the duplicate emails have been removed.
We also have some demographic data of the respondents that lives in the data folder in the file called, “distinct_demos.csv”
Let’s read in the distinct_demo’s csv as “demo_data”
demo_data <- read_csv(here::here("data", "distinct_demos.csv"))
Now, let’s create one merged table by joining the df_raw table with the demo_data table and call it df
#Uncomment the code and run it
# df_raw %>%
# left_join(demo_data)
Why did this error out? Turns out R doesn’t know what unique key to use to join the two tables. We can either rename the e column in df_raw to email or we can tell the left_join() function which columns to merge on.
#option 1: rename column `e` to `email`
df_raw %>%
rename(email = e) %>%
left_join(demo_data)
#option 2: tell left_join which columns to join by
df_raw %>%
left_join(demo_data, by = c("e" = "email"))
#lets store the joined data sets as df_raw
df_raw <-
df_raw %>%
left_join(demo_data, by = c("e" = "email"))
df_raw
Use dfSummary to quick view the data set.
# view(dfSummary(df_raw))
Now we can see df_raw has a few new columns like gender, age, and firstname.
Let’s analyze Question 1!
First, let’s store a function that calculates the margin of error where p = sample proportion, n = sample size, and z is the critical value (z-score).
calc_moe <- function(p,n, z=1.96){
x <- z*sqrt(p*(1-p)/n)
return(x)
}
Now, let’s calculate a frequency table of the responses from Question 1, with the margin of error for each response.
Q1_table <- df_raw %>%
group_by(Q1) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q1_table
Let’s plot the frequency table using ggplot!
Q1_table %>%
ggplot(aes(x = Q1, y = proportion)) +
geom_col()
Hmm, looks like we want to reorder the bars from the order in which the response options were shown.
Let’s re-level the response options to give them order.
Q1_table <-
Q1_table %>%
mutate(Q1 =
fct_relevel(Q1,
"Daily",
"2-6 times a week",
"Once a week",
"2-3 times a month",
"Once a month or less")
)
Q1_table %>%
ggplot(aes(x = Q1, y = proportion)) +
geom_col()
Now let’s get fancy. Let’s color the bars, add error bars, and add labels to the bars and titles to the graph.
To understand what each line of code does, try commenting out a line and rerunning the code. Can you spot what’s different each time you run it?
Q1_table %>%
ggplot(aes(x = Q1, y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
title = "Q1",
subtitle = "How often do you order food delivery?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q1_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
Now let’s do Q2.
Q2_table <- df_raw %>%
group_by(Q2) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q2_table
Q2_table %>%
ggplot(aes(x = Q2, y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
title = "Q2",
subtitle = "How many food delivery apps do you have installed on your phone?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q2_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
What do you notice that’s different about Q3?
That’s right, the values are stored over multiple columns… that’s because the question is multi-select. We’ll need to use different code from Q1 and Q2 to analyze Q3 because of the way the data is organized.
You’ll need to transform the data from wide to long before grouping and calculating the frequencies. (Remember when we learned about gather()?)
Q3_table <-
df_raw %>%
select(starts_with("Q3")) %>% # Select all columns that start with "Q3"
gather(question, response) %>% # Organize the columns into key-value pairs
group_by(question) %>%
count(question, response) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
) %>%
filter(!is.na(response)) #Show values that are not Null
Q3_table
Q3_table %>%
ggplot(aes(x = response, y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
title = "Q3",
subtitle = "Which of the following are reasons why you order food delivery? Select all that apply.",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q3_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
Let’s reorder the bars by frequency
Q3_table %>%
ggplot(aes(x = reorder(response, -proportion), y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
x = "Q3",
title = "Q3",
subtitle = "Which of the following are reasons why you order food delivery? Select all that apply.",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q3_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 18))
Let’s analyze Q4.
Q4_table <- df_raw %>%
group_by(Q4) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q4_table
Notice the last row is titled NA? A few counts are NA means a few respondents skipped this question. Let’s be sure to remove these null values from our total n size.
Q4_table <-
df_raw %>%
filter(!is.na(Q4)) %>% #this line removes the Null values
group_by(Q4) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q4_table
Q4_table %>%
ggplot(aes(x = reorder(Q4, -proportion), y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
x = "Q4",
title = "Q4",
subtitle = "Which of the following best describes your most recent food delivery cuisine?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q4_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
Now let’s look at Q5’s satisfaction question. We’ll want to relevel the response options so they are plotted in order.
Q5_table <- df_raw %>%
group_by(Q5) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
) %>%
mutate(Q5 =
fct_relevel(Q5,
"Extremely dissatisfied",
"Somewhat dissatisfied",
"Neither satisfied nor dissatisfied",
"Somewhat satisfied",
"Extremely satisfied")
)
Q5_table
Q5_table %>%
ggplot(aes(x = Q5, y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
title = "Q5",
subtitle = "How satisfied or dissatisfied are you with your most recent food delivery order?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q5_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))
Sometimes our stakeholders are only interested in the Top 2 Boxes. Let’s calculate them by grouping the variables together. We can do this by using the
case_when() function.
Q5_table_top2 <-
df_raw %>%
mutate(Q5 = case_when(
Q5 %in% c("Extremely dissatisfied","Somewhat dissatisfied") ~ "Dissatisfied",
Q5 %in% c("Extremely satisfied", "Somewhat satisfied") ~ "Satisfied",
Q5 == "Neither satisfied nor dissatisfied" ~ "Neither"
)) %>%
group_by(Q5) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q5_table_top2
Q5_table_top2 %>%
ggplot(aes(x = Q5, y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
title = "Q5",
subtitle = "How satisfied or dissatisfied are you with your most recent food delivery order?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q5_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))
If you remember that we used the
janitor package earlier, you’ll see they have a function to quickly analyze top2 boxes. Try the function top_levels()
top_levels(Q5_table$Q5)
top_levels(Q5_table$Q5, n = 1)
Let’s look at Q6
Q6_table <- df_raw %>%
group_by(Q6) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q6_table
Looks like we have respondents who skipped this question, too. We want the proportion of those who answered the question. So let’s remove those who skipped the question from our total n size.
Q6_table <- df_raw %>%
filter(!is.na(Q6)) %>% #remove null rows from this question
group_by(Q6) %>%
summarize(n = n()) %>%
mutate(total = sum(n),
proportion = (n/total),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe
)
Q6_table
Notice how the total is now smaller because we removed the null values?
Q6_table %>%
ggplot(aes(x = reorder(Q6, -proportion), y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
x = "Q6",
title = "Q6",
subtitle = "How did you complete your most recent food delivery order?",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q6_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))
Q7_table <-
df_raw %>%
select(starts_with("Q7"), -ends_with("TEXT")) %>% # Exclude the open text column of Q7
gather(question,
response) %>%
group_by(question) %>%
count(question,
response) %>%
mutate(
total = sum(n),
proportion = n/sum(n),
moe = calc_moe(proportion, total),
lower = proportion - moe,
upper = proportion + moe) %>%
filter(!is.na(response))
Q7_table
Q7_table %>%
ggplot(aes(x = reorder(response, -proportion), y = proportion)) +
geom_col(fill = "#EA7580") +
geom_errorbar(
aes(ymin = lower, ymax = upper),
size = .5,
width = 0,
position = position_dodge(.7)
) +
geom_text(aes(label = paste0(round(proportion*100), "%"), y = upper), vjust = -.3) +
labs (
x = "Q7",
title = "Q7",
subtitle = "How could your most recent food delivery experience been improved? Select all that apply.",
caption = glue("Error bars represent 95% confidence intervals\nn = {Q7_table$total}")
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))
Use the code below to save this chart as a png so we can include it in a presentation.
Do you remember how to customize the size of the png?
ggsave("Q7_plot.png")